This notebook documents the exploratory analysis and prompt development used to extract structured data from AAC accident report text. It covers three areas:
The extraction prompt and pipeline used in production are in
r/extract_report_data.R. See
exploratory_analysis/recent_accident_analysis.html for
analysis of the full extracted dataset.
Start by visualizing how many articles we are working with and what kind of usable metadata we have.
This analysis calls the Claude API, and uses the Haiku 4.5 model to read the accident report and create a json file that includes the risk factors and other key data in a structured format.
## additional libraries ----
library(httr2)
library(jsonlite)
library(dplyr)
library(readr)
library(glue)
`%||%` <- function(x, y) if (is.null(x)) y else x
## api configuration ----
input_file <- "data/article_text_20260214.csv"
output_file <- "data/article_extracted.csv"
model <- "claude-haiku-4-5-20251001"
delay_sec <- 0.6 # stay within API rate limits
api_key <- Sys.getenv("ANTHROPIC_API_KEY")
## Call Claude and return parsed list ----
call_claude <- function(body_text) {
resp <- request("https://api.anthropic.com/v1/messages") %>%
req_headers(
"x-api-key" = api_key,
"anthropic-version" = "2023-06-01"
) %>%
req_body_json(list(
model = model,
max_tokens = 1024,
system = system_prompt,
messages = list(
list(role = "user", content = body_text)
)
), auto_unbox = TRUE) %>%
req_error(is_error = \(r) FALSE) %>% # handle errors manually
req_perform()
if (resp_status(resp) != 200) {
stop("API error ", resp_status(resp), ": ", resp_body_string(resp))
}
raw_text <- resp %>%
resp_body_json() %>%
.[["content"]] %>%
.[[1]] %>%
.[["text"]] %>%
str_remove_all("^```json\\s*|^```\\s*|\\s*```$")
fromJSON(raw_text)
}
This analysis is builds on a NLP analysis that Eliot Caroom did in 2020, and uses the risk categories that he defines. His analysis can be found on GitHub, and a summary of the results were included in Rock and Ice magaziene:
Caroom, Eliot. Climbing Accidents Data Repository: Analyzing 30 Years of Accident Reports. Rock & Ice Issue 265, September 2020, pages 18-23.
Eliot’s risk factors differ from the ones used in offcial AAC publications, but have substantial overlap. Note that the risk category defenitions have been streamlined to use tokens more efficiently, and don’t exactly match the defenitions cited in Eliot’s documentation.
## prompt ----
system_prompt <- "You are an expert analyst of mountaineering accident reports.
Extract structured information from the report and return it as a JSON object
with exactly these fields:
- accident_date: date of the accident in YYYY-MM-DD format, or null if unknown
- time_of_day: one of 'morning', 'afternoon', 'evening', 'night', 'unknown'
- location_country: country where the accident occurred
- location_state_region: state, province, or region (null if unknown)
- location_peak_area: specific mountain, peak, or climbing area name (null if unknown)
- route_name: name of the specific climbing route (null if not mentioned)
- route_difficulty: grade of the climb, likely matches one of these styles:
- '5.10a PG'
- '5.4'
- '5.9X'
- '4th Class'
- 'M4'
- 'WI4'
- 'C1'
- 'A4'
- '6b'
- 'V12'
- risk_factors: array of strings describing risk factors that contribute to the accident; strings must be one of the following:
- 'Piton/Ice Screw'
- 'Ascent Illness': HAPE, HACE, AMS, or ascending too fast.
- 'Crampon Issues': Any crampon difficulty — clearing balled snow, putting on/taking off, or misuse (e.g. glissading with crampons).
- 'Glissading'
- 'Ski-related': Only when skiing at time of accident; not applied when skis are off.
- 'Poor Position'
- 'Visibility': Dark, whiteout, or snowblind at time of accident (not during rescue). Includes being late in the day with diminishing light.
- 'Severe Weather / Act of God': Includes lightning.
- 'Natural Rockfall': Rockfall not caused by humans; excludes objects dislodged by climbing parties.
- 'Wildlife'
- 'Avalanche'
- 'Poor Cond/Seasonal Risk'
- 'Cornice / Snow Bridge Collapse'
- 'Bergschrund'
- 'Crevasse / Moat / Berschrund'
- 'Icefall / Serac / Ice Avalanche'
- 'Exposure'
- 'Non-Ascent Illness'
- 'Off-route': Straying from the intended route; excludes failure to follow ranger/guide directions.
- 'Rushed'
- 'Run Out'
- 'Crowds'
- 'Inadequate Food/Water'
- 'No Helmet'
- 'Late in Day'
- 'Late Start'
- 'Party Separated'
- 'Ledge Fall': Injurious landing on a ledge; excludes breaking ledges (see Handhold/Foothold Broke) and incidental/fortuitous landings.
- 'Gym / Artificial'
- 'Gym Climber'
- 'Fatigue'
- 'Large Group'
- 'Distracted'
- 'Object Dropped/Dislodged': Objects dropped or dislodged by climbing parties; includes dropped rope and gear. Excludes natural rockfall.
- 'Handhold/Foothold Broke'
- 'Knot & Tie-in Error'
- 'No Backup or End Knot'
- 'Gear Broke'
- 'Intoxicated'
- 'Inadequate Equipment': Missing or insufficient clothing/gear; excludes helmet (has its own category).
- 'Inadequate Protection / Pulled': No or insufficient protection placed.
- 'Anchor Failure / Error': Errors building or failures of anchors; can co-occur with Rappel/Lowering Error.
- 'Stranded / Lost / Overdue'
- 'Belay Error'
- 'Rappel Error'
- 'Lowering Error'
- 'Miscommunication'
- 'Pendulum'
- climbing_style: array of strings describing the climbing activity at the point of the accident; strings must be one of the following:
- 'Descent'
- 'Roped'
- 'Trad Climbing'
- 'Sport'
- 'Top-Rope'
- 'Aid & Big Wall Climbing'
- 'Unroped': Includes glissade and self-arrest incidents.
- 'Solo': Includes self-belayed climbing.
- 'Climbing Alone'
- 'Bouldering'
- 'Non-climbing'
- 'Alpine/Mountaineering'
- 'Ice Climbing'
- party_members: a nested object containing the following fields:
- name: the climber's name (only include people involved in the incident)
- age: the climber's age as a number (null if unknown)
- status: one of 'no injury', 'minor injury', 'serious injury', 'fatal injury', 'unknown'
Return only the JSON object, no other text."
## Test prompt with select articles ----
reports <- articles_tagged %>%
filter(!grepl("know the ropes", tolower(title))) %>%
filter(article_type %in% c("Accident Report", "Other - Accident Mention")) %>%
mutate(
all_text = paste(
title,
subtitle,
if_else(nchar(author) > 5, glue("Author: {author}"), ""),
if_else(nchar(publication_year) > 5, glue("Publication Year: {publication_year}"), ""),
if_else(nchar(climb_year) > 5, glue("Climb Year: {climb_year}"), ""),
body_text,
sep = "\n"
)
)
test_text <- as.character(paste(reports[10,12]))
# test_text
# test_result <- call_claude(test_text)
# write_json(test_result, path = paste0("data/api_archive/test_result_", format(Sys.time(), "%Y%m%d_%H%M%S"), ".json"), auto_unbox = TRUE, pretty = TRUE)
Link to article referecned: https://publications.americanalpineclub.org/articles/13201217355
## prompt ----
system_prompt <- "You are an expert analyst of mountaineering accident reports.
Extract structured information from the report and return it as a JSON object
with exactly these fields:
- accident_date: date of the accident in YYYY-MM-DD format, or null if unknown
- time_of_day: one of 'morning', 'afternoon', 'evening', 'night', 'unknown'
- location_country: country where the accident occurred
- location_state_region: state, province, or region (null if unknown)
- location_peak_area: specific mountain, peak, or climbing area name (null if unknown)
- route_name: name of the specific climbing route (null if not mentioned)
- route_difficulty: grade of the climb, likely matches one of these styles:
- '5.10a PG'
- '5.4'
- '5.9X'
- '4th Class'
- 'M4'
- 'WI4'
- 'C1'
- 'A4'
- '6b'
- 'V12'
- immediate_cause: array of strings describing risk factors that directly caused the accident; strings must be one of the following:
- 'Fall on Rock'
- 'Fall on Ice'
- 'Fall on Snow'
- 'Falling Rock, Ice, Object'
- 'Illness'
- 'Stranded / Lost'
- 'Avalanche'
- 'Rappel Failure / Error'
- 'Lowering Error'
- 'Fall from Anchor'
- 'Anchor Failure'
- 'Exposure'
- 'Glissade Error'
- 'Protection Pulled Out'
- 'Failure to Follow Route'
- 'Fall into Crevasse / Moat'
- 'Faulty use of Crampons'
- 'Ascending too Fast'
- 'Skiing'
- 'Lightning'
- 'Equipment Failure'
- 'Unknown'
- objective_risk_factors: array of strings describing the environmental risk factors that contributed to the accident; strings must be one of the following:
- 'Visibility': Dark, whiteout, or snowblind at time of accident (not during rescue). Includes diminishing light late in the day.
- 'Severe Weather / Act of God': Includes lightning.
- 'Natural Rockfall': Rockfall not caused by humans; excludes objects dislodged by climbing parties.
- 'Wildlife'
- 'Poor Cond/Seasonal Risk'
- 'Cornice / Snow Bridge Collapse'
- 'Crevasse / Moat / Bergschrund'
- 'Icefall / Serac / Ice Avalanche'
- 'Non-Ascent Illness'
- 'Gym / Artificial'
- 'Handhold/Foothold Broke'
- 'Inadequate Protection Available': Route is difficult or impossible to protect.
- subjective_risk_factors: array of strings describing the gear and skill based risk factors that contributed to the accident; strings must be one of the following:
- 'Piton/Ice Screw'
- 'Crampon Issues': Crampon difficulty — balling snow, putting on/off, or misuse (e.g. glissading with crampons).
- 'Poor Position'
- 'Off-route': Straying from the intended route; excludes failure to follow ranger/guide directions.
- 'Run Out'
- 'Inadequate Food/Water'
- 'No Helmet'
- 'Late in Day'
- 'Late Start'
- 'Fatigue'
- 'Object Dropped/Dislodged': Party-dislodged objects including rope and gear; excludes natural rockfall.
- 'Knot & Tie-in Error'
- 'No Backup or End Knot'
- 'Gear Broke'
- 'Inadequate Equipment': Missing or insufficient clothing/gear; excludes helmet (has its own category).
- 'Inadequate Protection / Pulled': No or insufficient protection placed.
- 'Anchor Failure / Error': Anchor building errors or failures; can co-occur with Rappel/Lowering Error.
- 'Stranded / Lost / Overdue'
- 'Belay Error'
- 'Rappel Error'
- 'Lowering Error'
- 'Pendulum'
- social_risk_factors: array of strings describing the social and psychological risk factors that contributed to the accident; strings must be one of the following:
- 'Rushed'
- 'Crowds'
- 'Party Separated'
- 'Gym Climber'
- 'Large Group'
- 'Distracted'
- 'Intoxicated'
- 'Miscommunication'
- 'Familiarity': Overconfidence in familiar terrain.
- 'Acceptance': Desire for group acceptance led to increased risk tolerance.
- 'Consistency': Overcommitment to a goal despite changing conditions.
- 'Expert Halo': Less experienced members deferred to a leader, accepting more risk than they would alone.
- 'Tracks/Scarcity': Perceived competition for first position or a closing window of opportunity.
- 'Social Facilitation': False sense of safety from the presence of other groups on the route.
- climbing_style: array of strings describing the climbing activity at the point of the accident; strings must be one of the following:
- 'Descent'
- 'Roped'
- 'Trad Climbing'
- 'Sport'
- 'Top-Rope'
- 'Aid & Big Wall Climbing'
- 'Unroped': Includes glissade and self-arrest incidents.
- 'Solo': Includes self-belayed climbing.
- 'Climbing Alone'
- 'Bouldering'
- 'Non-climbing'
- 'Alpine/Mountaineering'
- 'Ice Climbing'
- party_members: a nested object containing the following fields:
- name: the climber's name (only include people involved in the incident)
- age: the climber's age as a number (null if unknown)
- party_status: one of 'solo', 'party_member', 'party_leader', 'unknown'
- injury_level: one of 'no injury', 'minor injury', 'serious injury', 'fatal injury', 'unknown'
Return only the JSON object, no other text."
## Test prompt with select articles ----
test_text <- as.character(paste(reports[10,12]))
# test_text
# test_result <- call_claude(test_text)
# write_json(test_result, path = paste0("data/api_archive/test_result_", format(Sys.time(), "%Y%m%d_%H%M%S"), ".json"), auto_unbox = TRUE, pretty = TRUE)
Additional refninement will be done testing the outputs across larger samples.
Analysis by Nate Downer